Spanish Elections!
Analyzing election data in Spain 2008-2019
Animated GIF
The practice will be based on the electoral data archives below, compiling data on elections to the Spanish Congress of Deputies from 2008 to the present, as well as surveys, municipalities codes and abbreviations.
First, we convert the data to tidy data: In both “election_data” and “surveys” databases, each party appears as a variable; so we create a new variable that unites all these parties so they appear as categories of a single variable “party”, and the values appear in a new variable called “ballots” and “estimated_voting” respectively.
Now we eliminate the non relevant parties (according to points 6 and 7) for “surveys_tidy” and “election_data_tidy”
For the “survey” database:
important_parties <- c(
"PSOE", "PP", "CS", "BNG", "EAJ-PNV", "CIU",
"PODEMOS", "IU", "UP", "EH-BILDU", "ERC", "VOX", "EE", "EA",
"EH", "PCE", "MP"
)
surveys_tidy1 <- surveys_tidy |>
mutate(party = case_when(
party %in% important_parties ~ party, # keep important parties
TRUE ~ "Other" # new category: other parties
)) |>
#Change some party names
mutate(party = case_when(
party == "PODEMOS" ~ "UNIDAS PODEMOS-IU",
party == "IU" ~ "UNIDAS PODEMOS-IU",
party == "UP" ~ "UNIDAS PODEMOS-IU",
party == "PCE" ~ "PODEMOS-IU",
party == "MUC" ~ "PODEMOS-IU", # party integrated into IU but minority
party == "ICV" ~ "PODEMOS-IU", # Verdes in Valencia go with "Podemos"
party == "EE" ~ "PSOE", # PSOE in País Vasco
party == "EA" ~ "EH-BILDU",
party == "EH" ~ "EH-BILDU",
party == "COMPROMIS" ~ "PODEMOS-IU",
party == "JXCAT-JUNTS" ~ "CIU",
party == "PDECAT" ~ "CIU",
TRUE ~ party # keep the rest without changes
))We check which parties we have now in the first surveys_tidy
[1] "Other" "PSOE" "PODEMOS-IU"
[4] "CIU" "EAJ-PNV" "ERC"
[7] "UNIDAS PODEMOS-IU" "EH-BILDU" "PP"
[10] "BNG" "CS" "VOX"
[13] "MP"
We do the same for the “election_data_tidy” database.
In this case we don’t do it manually as before since each party has many branches, federations and coalitions. That is why we use the grepl() code. We also make sure that it groups all the parties regardless of capital letters, accents or hyphens.
Additional notes:
Since we didn’t add Junts for Convergencia i Unio (as it wasn’t in the instructions), and the party dissolved in 2015, they don’t get votes from 2015 onwards.
We noticed that in election data, Más País only has votes for the november 2019 elections, that’s why it doesn’t appear in other elections after we clean the data.
important_parties2 <- c("PODEMOS", "ARALAR", "EUSKO ALKARTASUNA", "EN MAREA", "EQUO", "ESQUERRA REPUBLICANA") # exceptions that we add manually
election_data_tidy1 <- election_data_tidy |>
mutate(party = case_when(
party %in% important_parties2 ~ party,
grepl("PARTIDO NACIONALISTA VASCO", party, ignore.case = TRUE) ~ "PARTIDO NACIONALISTA VASCO",
grepl("BLOQUE NACIONALISTA GALEGO", party, ignore.case = TRUE) ~ "BLOQUE NACIONALISTA GALEGO",
grepl("VOX", party, ignore.case = TRUE) ~ "VOX",
grepl("ESQUERRA REPUBLICANA DE CATALUNYA", party, ignore.case = TRUE) ~ "ESQUERRA REPUBLICANA DE CATALUNYA",
grepl("CONVERGENCIA i UNIO", stri_trans_general(party, "Latin-ASCII"), ignore.case = TRUE) ~ "CONVERGÈNCIA i UNIÓ",
grepl("MAS PAIS", stri_trans_general(party, "Latin-ASCII"), ignore.case = TRUE) ~ "MÁS PAÍS",
grepl("PODEMOS|UNIDAS PODEMOS|UNIDOS PODEMOS|IZQUIERDA UNIDA|UNITS PODEM|XUNIES PODEMOS", stri_trans_general(party, "Latin-ASCII"), ignore.case = TRUE) ~ "PODEMOS-IU",
grepl("PARTIDO POPULAR|PARTIT POPULAR|PP", party, ignore.case = TRUE) ~ "PARTIDO POPULAR",
grepl("PARTIDO SOCIALISTA|PSOE", party, ignore.case = TRUE) ~ "PARTIDO SOCIALISTA OBRERO ESPAÑOL",
grepl("PARTIDO DE LA CIUDADANIA", stri_trans_general(party, "Latin-ASCII"), ignore.case = TRUE) ~ "CIUDADANOS PARTIDO DE LA CIUDADANÍA",
grepl ("EUSKAL HERRIA BILDU", party, ignore.case = TRUE) ~ "EUSKAL HERRIA BILDU",
TRUE ~ "Other"
))
election_data_tidy2 <- election_data_tidy1 |>
mutate(party = case_when(
party == "PODEMOS" ~ "PODEMOS-IU",
party == "EN MAREA" ~ "PODEMOS-IU",
party == "EQUO" ~ "PODEMOS-IU",
party == "EUSKAL HERRIA BILDU (EH Bildu)" ~ "EH BILDU",
party == "EUSKAL HERRIA BILDU" ~ "EH BILDU",
party == "ESQUERRA REPUBLICANA" ~ "ESQUERRA REPUBLICANA DE CATALUNYA",
party == "ARALAR" ~ "EH BILDU",
party == "EUSKO ALKARTASUNA" ~ "EH BILDU",
TRUE ~ party # keep the rest without changes
))
unique(election_data_tidy2$party) # to check the results [1] "Other" "EH BILDU"
[3] "PARTIDO POPULAR" "PARTIDO SOCIALISTA OBRERO ESPAÑOL"
[5] "PARTIDO NACIONALISTA VASCO" "CIUDADANOS PARTIDO DE LA CIUDADANÍA"
[7] "PODEMOS-IU" "CONVERGÈNCIA i UNIÓ"
[9] "ESQUERRA REPUBLICANA DE CATALUNYA" "BLOQUE NACIONALISTA GALEGO"
[11] "VOX" "MÁS PAÍS"
Animated GIF
We eliminate those columns that don’t provide extra information where there is no variance. We apply this process to the “election” and “survey” data.
Also, according to point 11, we have to discard some polls, so we filter:
surveys_tidy2 <- surveys_tidy1 |>
mutate(
date_elec = ymd(date_elec),
field_date_from = ymd(field_date_from),
field_date_to = ymd(field_date_to),
fieldwork_days = as.numeric(difftime(field_date_to, field_date_from, units = "days")) # we calculate the fieldwork days
) |>
filter(
date_elec >= "2008-01-01", # elections >= 2008
exit_poll == "FALSE", # no exit polls
size >= 750, # size >= 750 and known
fieldwork_days > 1 # fieldwork days > 1
) |>
select (-type_survey) # discard unimportant columnsPossible joins that may be useful:
election_data_tidy_new <- election_data_tidy3 |>
mutate(cod_mun_full = paste(codigo_ccaa, codigo_provincia, codigo_municipio, sep = "-"))
#the same with the "glue" library:
#election_data_tidy_new <- election_data_tidy3 |>
#mutate(cod_mun_full = glue("{codigo_ccaa}-{codigo_provincia}-{codigo_municipio}"))
# left join to have the names of municipalities
election_data_tidy_new <- election_data_tidy_new |>
left_join(cod_mun, by = c("cod_mun_full" = "cod_mun"))Now we have a table with the info from election_data and cod_mun.
First, we have to clean the names from “abbrev” database:
abbrev_clean <- abbrev |>
mutate(denominacion = case_when(
denominacion %in% important_parties2 ~ denominacion, # same parties as before
grepl("PARTIDO NACIONALISTA VASCO", denominacion, ignore.case = TRUE) ~ "PARTIDO NACIONALISTA VASCO",
grepl("BLOQUE NACIONALISTA GALEGO", denominacion, ignore.case = TRUE) ~ "BLOQUE NACIONALISTA GALEGO",
grepl("VOX", denominacion, ignore.case = TRUE) ~ "VOX",
grepl("ESQUERRA REPUBLICANA DE CATALUNYA", denominacion, ignore.case = TRUE) ~ "ESQUERRA REPUBLICANA DE CATALUNYA",
grepl("CONVERGENCIA i UNIO", stri_trans_general(denominacion, "Latin-ASCII"), ignore.case = TRUE) ~ "CONVERGÈNCIA i UNIÓ",
grepl("MAS PAIS", stri_trans_general(denominacion, "Latin-ASCII"), ignore.case = TRUE) ~ "MÁS PAÍS",
grepl("PODEMOS|UNIDAS PODEMOS|UNIDOS PODEMOS|IZQUIERDA UNIDA|UNITS PODEM|XUNIES PODEMOS", stri_trans_general(denominacion, "Latin-ASCII"), ignore.case = TRUE) ~ "PODEMOS-IU",
grepl("PARTIDO POPULAR|PARTIT POPULAR|PP", denominacion, ignore.case = TRUE) ~ "PARTIDO POPULAR",
grepl("PARTIDO SOCIALISTA|PSOE", denominacion, ignore.case = TRUE) ~ "PARTIDO SOCIALISTA OBRERO ESPAÑOL",
grepl("PARTIDO DE LA CIUDADANIA", stri_trans_general(denominacion, "Latin-ASCII"), ignore.case = TRUE) ~ "CIUDADANOS PARTIDO DE LA CIUDADANÍA",
grepl ("EUSKAL HERRIA BILDU", denominacion, ignore.case = TRUE) ~ "EUSKAL HERRIA BILDU",
TRUE ~ "Other"
))
abbrev_clean2 <- abbrev_clean |>
mutate(denominacion = case_when(
denominacion == "PODEMOS" ~ "PODEMOS-IU",
denominacion == "EN MAREA" ~ "PODEMOS-IU",
denominacion == "EQUO" ~ "PODEMOS-IU",
denominacion == "EUSKAL HERRIA BILDU (EH Bildu)" ~ "EH BILDU",
denominacion == "EUSKAL HERRIA BILDU" ~ "EH BILDU",
denominacion == "ESQUERRA REPUBLICANA" ~ "ESQUERRA REPUBLICANA DE CATALUNYA",
denominacion == "ARALAR" ~ "EH BILDU",
denominacion == "EUSKO ALKARTASUNA" ~ "EH BILDU",
TRUE ~ denominacion # keep the rest without changes
))
unique(abbrev_clean2$denominacion) #To check the results [1] "Other" "PARTIDO NACIONALISTA VASCO"
[3] "PODEMOS-IU" "EH BILDU"
[5] "PARTIDO POPULAR" "PARTIDO SOCIALISTA OBRERO ESPAÑOL"
[7] "ESQUERRA REPUBLICANA DE CATALUNYA" "CONVERGÈNCIA i UNIÓ"
[9] "BLOQUE NACIONALISTA GALEGO" "CIUDADANOS PARTIDO DE LA CIUDADANÍA"
[11] "VOX" "MÁS PAÍS"
# change siglas
abbrev_clean2 <- abbrev_clean2 |>
mutate(siglas = case_when(
denominacion == "PARTIDO SOCIALISTA OBRERO ESPAÑOL" ~ "PSOE",
denominacion == "PARTIDO POPULAR" ~ "PP",
denominacion == "PODEMOS-IU" ~ "PODEMOS-IU",
denominacion == "EH BILDU" ~ "EH-BILDU",
denominacion == "ESQUERRA REPUBLICANA DE CATALUNYA" ~ "ERC",
denominacion == "BLOQUE NACIONALISTA GALEGO" ~ "BNG",
denominacion == "VOX" ~ "VOX",
denominacion == "MÁS PAÍS" ~ "MP",
denominacion == "PARTIDO NACIONALISTA VASCO" ~ "EAJ-PNV",
denominacion == "CIUDADANOS PARTIDO DE LA CIUDADANÍA" ~ "CS",
denominacion == "Other" ~ "Other",
denominacion == "CONVERGÈNCIA i UNIÓ" ~ "CIU",
TRUE ~ siglas
))
unique(abbrev_clean2$siglas) [1] "Other" "EAJ-PNV" "PODEMOS-IU" "EH-BILDU" "PP"
[6] "PSOE" "ERC" "CIU" "BNG" "CS"
[11] "VOX" "MP"
Then we do the join, which we can do both with the survey and the election data, so we have a more complete databases. We also reorder and rename some of the new variables so it’s clearer:
# election data
final_election_data <- election_data_tidy_new |>
left_join(abbrev_clean2_unique, by = c("party" = "denominacion"))
final_election_data <- final_election_data |>
relocate(siglas, .after = party)
# survey data
final_survey_data <- surveys_tidy2 |>
left_join(abbrev_clean2_unique, by = c("party" = "siglas"))
final_survey_data <- final_survey_data |>
relocate(denominacion, .before = party) |>
rename(
abbrev = party,
party = denominacion
)
final_survey_data <- final_survey_data |>
mutate(party = if_else(is.na(party), "Other", party))Animated GIF
Which party was the winner in the municipalities with more than 100,000 inhabitants (census) in each of the elections?
# we filter the large municipalities
large_municipalities <- final_election_data |>
filter(censo > 100000)
# we sum up the total votes grouping by municipality, party, year and month
votes_by_party <- large_municipalities |>
group_by(municipio, party, anno, mes) |>
summarise(total_votes = sum(ballots, na.rm = TRUE), .groups = "drop")
# we check the winners by municipality
winners_by_municipality <- votes_by_party |>
group_by(municipio, anno, mes) |>
filter(total_votes == max(total_votes)) |>
ungroup()
print(winners_by_municipality) # A tibble: 282 × 5
municipio party anno mes total_votes
<chr> <chr> <dbl> <chr> <dbl>
1 Albacete PARTIDO POPULAR 2008 03 49909
2 Albacete PARTIDO POPULAR 2011 11 55858
3 Albacete PARTIDO POPULAR 2015 12 36149
4 Albacete PARTIDO POPULAR 2016 06 38470
5 Albacete PARTIDO SOCIALISTA OBRERO ESPAÑOL 2019 04 28729
6 Albacete PARTIDO SOCIALISTA OBRERO ESPAÑOL 2019 11 27074
7 Alcalá de Henares PARTIDO POPULAR 2008 03 48087
8 Alcalá de Henares PARTIDO POPULAR 2011 11 46667
9 Alcalá de Henares PARTIDO POPULAR 2015 12 30927
10 Alcalá de Henares PARTIDO POPULAR 2016 06 33949
# ℹ 272 more rows
Which party was the second when the first was the PSOE? And when the first was the PP?
When PSOE was first:
# first we do the same as Q1 but for all municipalities
votes_by_party_all <- final_election_data |>
group_by(municipio, siglas, anno, mes) |>
summarise(total_votes = sum(ballots, na.rm = TRUE), .groups = "drop")
winners_by_municipality_all <- votes_by_party_all |>
group_by(municipio, anno, mes) |>
filter(total_votes == max(total_votes)) |>
ungroup()
# we extract the second when PSOE was first
psoe_winner <- winners_by_municipality_all |> filter(siglas == "PSOE")
second_place_psoe_first <- votes_by_party_all |> filter(municipio %in% psoe_winner$municipio) |>
filter(!(siglas == "PSOE")) |>
group_by(municipio, anno, mes) |>
filter(total_votes == max(total_votes)) |>
ungroup()
results_psoe_winner <- psoe_winner |>
rename(winner_party = siglas, winner_votes = total_votes) |>
left_join(
second_place_psoe_first |>
rename(second_party = siglas, second_votes = total_votes),
by = c("municipio", "anno", "mes")
) |>
relocate(anno, mes, .after = municipio)
results_psoe_winner # we see the results in column "second_party"# A tibble: 14,113 × 7
municipio anno mes winner_party winner_votes second_party second_votes
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl>
1 Ababuj 2019 04 PSOE 13 PP 13
2 Abadiño 2008 03 PSOE 1243 EAJ-PNV 1050
3 Abadía 2008 03 PSOE 148 PP 72
4 Abadía 2011 11 PSOE 126 PP 89
5 Abadía 2015 12 PSOE 114 PP 60
6 Abadía 2016 06 PSOE 111 PP 68
7 Abadía 2019 04 PSOE 128 PP 34
8 Abadía 2019 11 PSOE 112 PP 31
9 Abadía 2019 11 PSOE 112 VOX 31
10 Abajas 2008 03 PSOE 14 PP 9
# ℹ 14,103 more rows
When PP was first:
# we extract the second when PP was first in the same way as before
pp_winner <- winners_by_municipality_all |> filter(siglas == "PP")
second_place_pp_first <- votes_by_party_all |> filter(municipio %in% pp_winner$municipio) |>
filter(!(siglas == "PP")) |>
group_by(municipio, anno, mes) |>
filter(total_votes == max(total_votes)) |>
ungroup()
results_pp_winner <- pp_winner |> rename(winner_party = siglas, winner_votes = total_votes) |>
left_join(
second_place_pp_first |>
rename(second_party = siglas, second_votes = total_votes),
by = c("municipio", "anno", "mes")
) |>
relocate(anno, mes, .after = municipio)
results_pp_winner #Results in column "Second party"# A tibble: 26,531 × 7
municipio anno mes winner_party winner_votes second_party second_votes
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl>
1 Ababuj 2008 03 PP 27 PSOE 25
2 Ababuj 2011 11 PP 28 PSOE 10
3 Ababuj 2015 12 PP 22 PODEMOS-IU 14
4 Ababuj 2016 06 PP 26 PODEMOS-IU 10
5 Ababuj 2019 04 PP 13 PSOE 13
6 Ababuj 2019 11 PP 12 Other 11
7 Abades 2008 03 PP 363 PSOE 229
8 Abades 2011 11 PP 353 PSOE 146
9 Abades 2015 12 PP 276 PSOE 153
10 Abades 2016 06 PP 289 PSOE 115
# ℹ 26,521 more rows
Now we see it in a graph to see the results more clearly for each year. When PSOE first
second_party_year_psoe_summary <- results_psoe_winner |>
group_by(second_party, anno) |>
summarise(count = n(), .groups = "drop")
ggplot(second_party_year_psoe_summary, aes(x = factor(anno), y = count, fill = second_party)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("BNG" = "pink", "CS" = "orange",
"CIU" = "darkred", "EH-BILDU" = "darkgreen", "ERC" = "yellow", "Other" = "grey", "EAJ-PNV" = "lightblue", "PP" = "blue", "PODEMOS-IU" = "purple", "VOX" = "green")) +
labs(title = "Distribution of second places by year when PSOE wins",
x = "Year",
y = "Number of municipalities",
fill = "Party") +
theme_minimal()A stacked bar graph for each year is a good option since we can see the differences between the most voted parties after the PSOE and at the same time we see how it has changed over the years/elections.
We can check how PP has always been the most voted after PSOE, although since 2015 when Podemos-IU appears and the two-party system begins to break down and other parties start to appear. Especially in 2019 we see more fracture, since VOX seems to be the third most voted, followed by Podemos and Ciudadanos.
Animated GIF
When PP first
second_party_year_pp_summary <- results_pp_winner |>
group_by(second_party, anno) |>
summarise(count = n(), .groups = "drop")
ggplot(second_party_year_pp_summary, aes(x = factor(anno), y = count, fill = second_party)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("BNG" = "pink", "CS" = "orange",
"CIU" = "darkred", "EH-BILDU" = "darkgreen", "ERC" = "yellow", "Other" = "grey", "EAJ-PNV" = "lightblue", "PP" = "blue", "PODEMOS-IU" = "purple", "VOX" = "green", "PSOE" = "red")) +
labs(title = "Distribution of second places by year when PP wins",
x = "Year",
y = "Number of municipalities",
fill = "Party") +
theme_minimal()Here we see much more diversity than when the PSOE was first, since people have been voting for a greater diversity of parties since 2008. This can be explained because the left is much more divided than the right and the alternative to the PP was not only PSOE but other parties. However, as before, since 2015 we see other parties appear that are voted in large proportion, such as Podemos and Ciudadanos, and in 2019 also VOX.
Who benefits from low turnout?
First of all, what is a low turnout? It makes reference to the electoral districts where the electoral participation is low. We consider low turnout <0,6 based on the turnout tendency in Spain (We can not compare with other countries because in some like USA 60% is moderate/high and in other like Sweden 60% is very very low). So we compare participatory trend in Spain during democracy, where the average participation rate in general elections in Spain is 72.1%, so <70% would be moderate and lower than usual and <60% would be low. (20 Minutos, 2023)
# we calculate participation and we classify it
turnout_election_data <- final_election_data|>
mutate(turnout = (participacion_1 + participacion_2) / censo) |>
mutate(turnout_bin = ifelse(turnout < 0.6, "Low Turnout", "High Turnout"))
# function to analyze the effect of low participation:
analyze_low_turnout <-
function(data, turnout_col = "turnout", ballots_col = "ballots", censo_col = "censo", party_col = "siglas") {
# calculate vote share
data <- data |>
group_by(codigo_municipio) |>
mutate(vote_share = !!sym(ballots_col) / !!sym(censo_col))
# calculate turnout and bin it into low and high categories
data <- data |>
mutate(turnout_bin = ifelse(!!sym(turnout_col) < 0.6,
"Low Turnout",
"High Turnout"))
# Summarize vote share by turnout category and party
summary <- data |>
group_by(!!sym(party_col), turnout_bin) |>
summarise(avg_vote_share = mean(vote_share, na.rm = TRUE), .groups = "drop") |>
arrange(desc(avg_vote_share))
return(summary)
}# A tibble: 24 × 3
siglas turnout_bin avg_vote_share
<chr> <chr> <dbl>
1 PP High Turnout 0.284
2 CIU High Turnout 0.260
3 EAJ-PNV High Turnout 0.248
4 CIU Low Turnout 0.248
5 PP Low Turnout 0.245
6 PSOE High Turnout 0.230
7 ERC High Turnout 0.171
8 PSOE Low Turnout 0.168
9 EAJ-PNV Low Turnout 0.163
10 EH-BILDU High Turnout 0.143
# ℹ 14 more rows
Visualization of the results:
ggplot(turnout_analysis, aes(x = siglas, y = avg_vote_share, fill = turnout_bin)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Vote Share by Turnout Level and Party",
x = "Party",
y = "Average Vote Share (%)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))The parties that benefit from low turnout are: Podemos-IU, VOX and Other parties, although it’s not a big difference.
How to analyze the relationship between census and vote? Is it true that certain parties win in rural areas?
In Spain is considered “rural” a population of <30000 inhabitants according to Ley 45/2007, de 13 de diciembre, «BOE» núm. 299, de 14 de diciembre de 2007, páginas 51339 a 51349 (11 págs.)
# classify rural vs urban
election_data_areas <- final_election_data |>
mutate(area_type = case_when(
censo < 30000 ~ "Rural",
TRUE ~ "Urban"
)) |>
group_by(area_type, siglas) |>
summarise(total_votes = sum(ballots, na.rm = TRUE)) |>
mutate(vote_share = total_votes / sum(total_votes))
# proportion of votes per party in each region
vote_share_table <- election_data_areas |>
group_by(siglas) |>
mutate(total_votes_party = sum(total_votes)) |>
ungroup() |>
mutate(vote_share_adjusted = total_votes / total_votes_party) |>
select(siglas, area_type, vote_share_adjusted) |>
pivot_wider(names_from = area_type, values_from = vote_share_adjusted) |>
mutate(Rural = Rural * 100, Urban = Urban * 100) |>
select(siglas, Rural, Urban)
vote_share_table# A tibble: 12 × 3
siglas Rural Urban
<chr> <dbl> <dbl>
1 BNG 65.6 34.4
2 CIU 51.8 48.2
3 CS 39.2 60.8
4 EAJ-PNV 52.2 47.8
5 EH-BILDU 64.6 35.4
6 ERC 53.1 46.9
7 MP 21.4 78.6
8 Other 44.8 55.2
9 PODEMOS-IU 41.4 58.6
10 PP 45.5 54.5
11 PSOE 48.5 51.5
12 VOX 43.2 56.8
We express the results in a plot:
# convert table to long format for ggplot
vote_share_long <- vote_share_table |>
pivot_longer(cols = c(Rural, Urban), names_to = "area_type", values_to = "vote_share")
# plot
ggplot(vote_share_long, aes(x = siglas, y = vote_share, fill = area_type)) +
geom_bar(stat = "identity", position = "stack") +
scale_fill_manual(values = c("Rural" = "#8FCB9B", "Urban" = "#4A90E2")) +
labs(title = "Vote Share by Party and Area Type (100%)",
x = "Party",
y = "Vote Share (%)",
fill = "Area Type") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))We chose a stacked bar graph to see for each party how many of its voters are from the urban area and the rural area and we see the following: There are parties that stand out for having more votes from urban areas, such as Más País and Ciudadanos, perhaps explained by the policies that the have more focused on city problems (climate change, modernization, etc.).
On the other hand, there are parties with a higher percentage of voters from rural areas, such as BNG, EH-Bildu, CiU and ERC. This may be due to its focus on defending regional identity, autonomy and local development, topics that resonate especially in these areas. In addition, his opposition to national parties and his historical connection with rural communities reinforce his support.
How to calibrate the error of the polls (remember that the polls are voting intentions at national level)?
# first, let's compute the total votes for each party in each election
total_votes_by_party <- final_election_data |>
group_by(anno, mes, party) |>
summarize(total_ballots = sum(ballots, na.rm = TRUE), .groups = "drop") |>
group_by(anno, mes) |>
mutate(total_ballots_election = sum(total_ballots, na.rm = TRUE)) |>
ungroup() |>
mutate(real_percentage = total_ballots / total_ballots_election) |>
select(anno, mes, party, real_percentage) |>
mutate(mes = as.integer(mes)) # Convert mes to integer (there was a problem without it before)
# now, let's average survey data for each party for each election
survey_estimates <- final_survey_data |>
filter(!is.na(estimated_voting)) |>
group_by(date_elec, party) |>
summarize(mean_estimated_voting = mean(estimated_voting, na.rm = TRUE), .groups = "drop")|>
mutate(anno = year(date_elec), mes = month(date_elec)) |>
select(anno, mes, party, mean_estimated_voting)
# now we'll merge the two datasets (election results and survey estimates)
# before, we change some party names so they fit
survey_estimates <- survey_estimates |>
mutate(party = case_when(
party == "CIUDADANOS PARTIDO DE LA CIUDADANÍA" ~ "CIUDADANOS PARTIDO DE LA CIUDADANÍA",
party == "PARTIDO POPULAR" ~ "PARTIDO POPULAR",
party == "PARTIDO SOCIALISTA OBRERO ESPAÑOL" ~ "PARTIDO SOCIALISTA OBRERO ESPAÑOL",
party == "UNIDAS PODEMOS-IU" ~ "PODEMOS-IU",
party == "VOX" ~ "VOX",
party == "EH-BILDU" ~ "EH BILDU",
TRUE ~ party
))
# we merge
poll_error_data <- left_join(total_votes_by_party, survey_estimates, by = c("anno", "mes", "party"))
# calculate the absolute error for each survey for each party
poll_error_data <- poll_error_data |>
mutate(error_percentage = abs(real_percentage - (mean_estimated_voting/100)))
# Print the data with the error (see the table below)
print(poll_error_data)# A tibble: 72 × 6
anno mes party real_percentage mean_estimated_voting error_percentage
<dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 2008 3 BLOQUE NA… 0.00835 1.08 0.00248
2 2008 3 CIUDADANO… 0.00182 NA NA
3 2008 3 CONVERGÈN… 0.0309 3.28 0.00196
4 2008 3 EH BILDU 0.00250 NA NA
5 2008 3 ESQUERRA … 0.0116 2.33 0.0118
6 2008 3 MÁS PAÍS 0 NA NA
7 2008 3 Other 0.127 4.53 0.0812
8 2008 3 PARTIDO N… 0.0121 1.79 0.00581
9 2008 3 PARTIDO P… 0.406 38.4 0.0218
10 2008 3 PARTIDO S… 0.375 42.2 0.0472
# ℹ 62 more rows
The code calculates the polling error by comparing real election results with average survey estimates. It first computes the actual vote percentages (real_percentage) from election data by dividing party ballots by total ballots. Then, it calculates the average predicted vote percentages (mean_estimated_voting) from survey data. The datasets are merged, and the absolute error is computed as the difference between real percentages and polling predictions. The result is a tibble showing errors for each party in each election.
Example: In 2008, BNG had a real percentage of ~0.835% and a polling estimate of 1.08%, resulting in an error of ~0.00248.
Which polling houses got it right the most and which ones deviated the most from the results?
Most Accurate Polling Houses:
# Calculate real vote percentages from election data
real_votes <- final_election_data |>
group_by(anno, mes, party) |>
summarize(total_ballots = sum(ballots, na.rm = TRUE), .groups = "drop") |>
group_by(anno, mes) |>
mutate(total_ballots_election = sum(total_ballots, na.rm = TRUE)) |>
ungroup() |>
mutate(real_percentage = total_ballots / total_ballots_election) |>
select(anno, mes, party, real_percentage) |>
mutate(mes = as.integer(mes)) # Convert mes to integer
# join real vote percentages with survey data
# adjust party names in survey data to match election data before the join
final_survey_data_adjusted <- final_survey_data |>
mutate(party = case_when(
party == "CIUDADANOS PARTIDO DE LA CIUDADANÍA" ~ "CIUDADANOS PARTIDO DE LA CIUDADANÍA",
party == "PARTIDO POPULAR" ~ "PARTIDO POPULAR",
party == "PARTIDO SOCIALISTA OBRERO ESPAÑOL" ~ "PARTIDO SOCIALISTA OBRERO ESPAÑOL",
party == "UNIDAS PODEMOS-IU" ~ "PODEMOS-IU",
party == "VOX" ~ "VOX",
party == "EH-BILDU" ~ "EH BILDU",
TRUE ~ party
)) |>
mutate(mes = month(date_elec), anno= year(date_elec))
survey_with_real <- final_survey_data_adjusted |>
left_join(real_votes, by = c("anno", "mes", "party"))# calculate the error for each pollster
pollster_error <- survey_with_real |>
group_by(pollster) |>
summarize(average_error = mean(abs(estimated_voting - (real_percentage * 100)), na.rm = TRUE), .groups = "drop")
# identify the pollsters with the smallest average error (most accurate)
best_pollsters <- pollster_error |>
arrange(average_error)
# identify the pollsters with the biggest average error (least accurate)
worst_pollsters <- pollster_error |>
arrange(desc(average_error))
print(head(best_pollsters))# A tibble: 6 × 2
pollster average_error
<chr> <dbl>
1 OPINA 2.61
2 METRA SEIS 2.70
3 VOX PÚBLICA 3.09
4 TOP POSITION 3.16
5 GALLUP 3.18
6 IBERCONSULTA 3.26
The Most Accurate Polling Houses were: OPINA, METRA SEIS and VOX PÚBLICA, with average errors of +/- 2-3% meaning that pollster’s estimates were, on average, 2-3 percentage points different from the actual results. This error refers to a general average over the years we have considered in our data set (2008-2019)
Less Accurate Polling Houses
# A tibble: 6 × 2
pollster average_error
<chr> <dbl>
1 TÁBULA V 6.01
2 IKERFEL 5.85
3 MYWORD 5.81
4 SYM CONSULTING 5.65
5 DEMOMÉTRICA 5.37
6 CELESTE-TEL 5.33
And the worst were: TÁBULA V, IKERFEL and MYWORD where the pollster’s estimates were, approximately, 6 percentage points different from the actual results.
Participation analysis: How does participation at the polls evolve over the years (2008-2019)?
# prepare Election data
election_data_turnout <- final_election_data |>
mutate(election_year = year(make_date(anno, mes))) # extract election year
# calculate Total census and Votes per year
turnout_by_year <- election_data_turnout |>
group_by(election_year) |>
summarise(
total_census = sum(censo, na.rm = TRUE), # calculate total census
total_votes = sum(votos_candidaturas, na.rm = TRUE) + sum(votos_blancos, na.rm = TRUE), # calculate total valid votes
.groups = "drop"
)
# calculate Turnout Percentage
turnout_by_year <- turnout_by_year |>
mutate(turnout_percentage = (total_votes / total_census) * 100)
print(turnout_by_year)# A tibble: 5 × 4
election_year total_census total_votes turnout_percentage
<dbl> <dbl> <dbl> <dbl>
1 2008 15446544456 11562392088 74.9
2 2011 15639297480 11069819064 70.8
3 2015 15791474256 11436300336 72.4
4 2016 15776182752 10913638608 69.2
5 2019 31769422872 22890157128 72.1
Now we visualize the results:
ggplot(turnout_by_year, aes(x = election_year, y = turnout_percentage)) +
geom_line(color = "darkblue", linewidth = 1.2) +
geom_point(color = "darkblue", size = 3) +
geom_text(aes(label = sprintf("%.1f%%", turnout_percentage), # add text labels with percentage
vjust = ifelse(election_year %in% c(2011, 2016), 1.5, -1)), # adjust vjust for 2011 and 2016
size = 3, color = "darkblue") +
scale_x_continuous(breaks = unique(turnout_by_year$election_year)) +
scale_y_continuous(limits = c(65, 80), breaks = seq(65, 80, 5), expand = c(0,0)) +
labs(title = "Turnout Percentage in Spanish Congressional Elections (2008-2019)",
x = "Election Year",
y = "Turnout Percentage") +
theme_minimal() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(color = "gray", linewidth = 0.3),
panel.grid.minor.y = element_blank())In the 2008 elections, participation reached 75%, which is high (above the Spanish average) and similar to that of the first democratic elections in Spain (1977), which reflected strong citizen commitment and consolidated democracy. However, in 2011 participation fell to 70.8%, which can be explained in part by the impact of the economic, social and political crisis that affected the country, generating distrust and uncertainty among citizens. In 2015, turnout rose almost 2 percentage points, probably driven by the emergence of new political actors such as Podemos, which offered an alternative to the traditional two-party system, sparking the interest and hope of many voters. However, in 2016 participation fell again, reaching the lowest level since 2008 (69.2%), reflecting possible disenchantment or political fatigue. In 2019, participation rebounded again, but did not reach the high levels of 2008, suggesting a partial recovery of electoral interest, although still far from the participation peaks observed in previous times.
Which parties received the most votes in each province from 2008 to the 2019 elections?
(Important: the number of votes is not equal to the final result translated into seats)
# we extract the party with the most votes in each province
resultado_max_votos <- final_election_data |>
filter(anno %in% c(2008, 2011, 2015, 2016, 2019)) |>
group_by(anno, codigo_provincia, siglas) |>
summarise(total_votos = sum(ballots, na.rm = TRUE)) |>
slice_max(order_by = total_votos, n = 1)
print(resultado_max_votos)# A tibble: 260 × 4
# Groups: anno, codigo_provincia [260]
anno codigo_provincia siglas total_votos
<dbl> <chr> <chr> <dbl>
1 2008 01 PSOE 69179
2 2008 02 PP 113317
3 2008 03 PP 488333
4 2008 04 PP 161366
5 2008 05 PP 66747
6 2008 06 PSOE 222269
7 2008 07 PSOE 207644
8 2008 08 Other 1514486
9 2008 09 PP 116241
10 2008 10 PSOE 139487
# ℹ 250 more rows
In order to see the results clearly we need a base map which we take from here: https://centrodedescargas.cnig.es/CentroDescargas/catalogo.do?Serie=CAANE , especifically Cartografía Base de España del Atlas Nacional de España 1:10.000.000 SHAPE 2006 - 2024
Reading layer `se89_10_admin_prov_a_x' from data source
`C:\Users\laura\OneDrive\Escritorio\Máster\Semester 1\Data Programming\GROUP TASK\group\SIANE_CARTO_BASE_S_10M\vigente\se89_10_admin_prov_a_x.shp'
using driver `ESRI Shapefile'
Simple feature collection with 50 features and 12 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -9.298027 ymin: 35.25069 xmax: 4.314508 ymax: 43.77594
Geodetic CRS: ETRS89
mapa_con_votos <- provincias |>
left_join(resultado_max_votos, by = c("id_prov" = "codigo_provincia"))
# colors for parties
siglas_colors <- c(
"BNG" = "skyblue",
"CS" = "#EB6109",
"CIU" = "#F7D117",
"EH BILDU" = "lightgreen",
"ERC" = "#FFB000",
"MP" = "#00B1E0",
"Other" = "#D3D3D3",
"EAJ-PNV" = "#228B22",
"PP" = "#1D84CE",
"PSOE" = "#FF0000",
"PODEMOS-IU" = "#6A097D",
"VOX" = "#5BC035"
)
# order in the legend
mapa_con_votos$siglas <- factor(mapa_con_votos$siglas, levels = c("PP", "PSOE", "PODEMOS-IU", "EAJ-PNV", "Other"))
mapa_con_votos$siglas[is.na(mapa_con_votos$siglas)] <- "Other"
# final map
ggplot(mapa_con_votos) +
geom_sf(aes(fill = siglas)) +
scale_fill_manual(values = siglas_colors) +
facet_wrap(~anno) +
theme_minimal() +
labs(title = "Parties with the most votes per province in general elections in Spain (2008-2019)",
fill = "Party") +
theme(
plot.title = element_text(size = 14, hjust = 0.5),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
panel.grid = element_blank(),
legend.position = c(0.95, 0.05),
legend.justification = c(1, 0),
legend.direction = "vertical"
)The parties with the most votes in almost the last two decades have been PP and PSOE, the great parties of the two-party system. In 2008, there was more balance between the parties voted in the provinces, but finally PSOE won. That explains the result in the following elections (2011) where there was a strong reaction from the right causing the PP to win. The PP also consolidated itself in the 2016 elections. It should be highlighted that in the 2015 elections no party obtained a majority to govern and no agreements were reached, which is why elections were repeated the following year. In 2019, the left reemerged and the PSOE won again** after the motion of censure against Rajoy (PP) the previous year (2018).
Furthermore, we see that there is no consistency throughout the whole peninsula, since we see over the years that in the southwest (part of Andalusia) people generally vote more for the PSOE, part of Castilla y León vote more PP, in Catalonia, other parties and in the Basque Country there’s more variety: Podemos, PSOE and PNV
Animated GIF
Analyze the proportionality of the vote in the 2019 November elections.
What we will do is study the proportion of votes obtained by each party at a national level, and then compare that to the proportion of “escaños” that each of those parties finally obtained. This will allow us to see which parties obtain a bigger proportion “escaños” compared to their proportion of votes and viceversa.
# First, let´s filter the votes by party so that we obtain only those for the elections of Novembers, 2019
votes_2019 <- votes_by_party_all |>
filter(anno == 2019, mes == 11)
# Now, let´s obtain de total of valid votes obtained in those elections in general, for all of the parties
national_total_2019 <- sum(votes_2019$total_votes)
# And, finally, let´s se the percentage of votes that each party obtained in a national level
votes_2019 <- votes_2019 |>
group_by(siglas) |>
summarise(votes_by_party_2019 = sum(total_votes)) |>
mutate(percentage_votes = (votes_by_party_2019 / national_total_2019) * 100)
votes_2019 <- votes_2019 |> rename(party = siglas)
votes_2019# A tibble: 12 × 3
party votes_by_party_2019 percentage_votes
<chr> <dbl> <dbl>
1 BNG 119597 0.501
2 CIU 0 0
3 CS 1637341 6.86
4 EAJ-PNV 377502 1.58
5 EH-BILDU 276535 1.16
6 ERC 869786 3.65
7 MP 370222 1.55
8 Other 3036267 12.7
9 PODEMOS-IU 2550852 10.7
10 PP 5021622 21.0
11 PSOE 5961901 25.0
12 VOX 3640377 15.3
# Now, let´s study the proportion of "escaños" obtained by each party. Our Congress has 350 "escaños":
escanos_total <- 350
party <- c("BNG", "CIU", "CS", "EAJ-PNV", "EH-BILDU", "ERC", "MP", "PODEMOS-IU", "PP", "PSOE", "VOX", "Other")
escanos_asigned <- c(1, 0, 10, 6, 5, 13, 3, 35, 89, 120, 26, 42)
percentage_escanos <- (escanos_asigned / escanos_total) * 100
# Again, let´s see the percentage of "escaños" that each of the parties obtained
results_2019_nov <- data.frame(
party = party,
escanos_asigned = escanos_asigned,
percentage_escanos = percentage_escanos
)
print(results_2019_nov) party escanos_asigned percentage_escanos
1 BNG 1 0.2857143
2 CIU 0 0.0000000
3 CS 10 2.8571429
4 EAJ-PNV 6 1.7142857
5 EH-BILDU 5 1.4285714
6 ERC 13 3.7142857
7 MP 3 0.8571429
8 PODEMOS-IU 35 10.0000000
9 PP 89 25.4285714
10 PSOE 120 34.2857143
11 VOX 26 7.4285714
12 Other 42 12.0000000
# Now, let´s compare the percentage of votes and the percentage of "escaños" by party
total_votes_2019 <- sum(votes_2019$votes_by_party_2019)
votes_2019$percentage_votes <- (votes_2019$votes_by_party_2019 / total_votes_2019) * 100
comparison <- merge(results_2019_nov, votes_2019[, c("party", "percentage_votes")], by = "party")
print(comparison) party escanos_asigned percentage_escanos percentage_votes
1 BNG 1 0.2857143 0.5012027
2 CIU 0 0.0000000 0.0000000
3 CS 10 2.8571429 6.8617084
4 EAJ-PNV 6 1.7142857 1.5820215
5 EH-BILDU 5 1.4285714 1.1588927
6 ERC 13 3.7142857 3.6450672
7 MP 3 0.8571429 1.5515127
8 Other 42 12.0000000 12.7242760
9 PODEMOS-IU 35 10.0000000 10.6900167
10 PP 89 25.4285714 21.0444287
11 PSOE 120 34.2857143 24.9849153
12 VOX 26 7.4285714 15.2559580
# And, finally, let´s take a look at a plot that will allow us to visualize this comparison
table_long <- pivot_longer(
comparison,
cols = c(percentage_escanos, percentage_votes),
names_to = "type",
values_to = "percentage"
)
table_long$party <- fct_reorder(table_long$party, table_long$percentage, .fun = max, .desc = TRUE)
ggplot(table_long, aes(x = party, y = percentage, fill = type)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
title = "Comparison of percentage of escaños vs votes by party",
x = "Party",
y = "Percentage (%)",
fill = "Type"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5)
)As we can see, bigger parties such as PP and PSOE obtained a proportion of “escaños” bigger than their proportion of votes at a national level. Smaller parties tend to work the other way around; and nationalist parties, such as ERC, Bildu… they tend to be almost perfetly proportional, obtaining the same proportion of “escaños” than their proportion of votes. This is due to the fact that in Spain we use as electoral districts the provinces, and that our electoral formula is a proportional one. Basically, this distribution of the districts and the choice of using a proportional formula leads to the fact that, in provinces with smaller populations, each vote has “more value” than in more populated provinces. For example, a vote in Soria is more valuable than one in Madrid. This leads to the fact that the biggest parties, PP and PSOE, tend to obtain a bigger proportion of “escaños” than their real proportion of votes.
Animated GIF
Protest vote analysis (2008-2019)
# 1. Calculate Total Votes, Blank and Null Votes per Year
votes_by_year <- election_data_turnout |>
group_by(election_year) |>
summarise(
total_votes = sum(votos_candidaturas, na.rm = TRUE) + sum(votos_blancos, na.rm = TRUE) + sum(votos_nulos, na.rm = TRUE),
total_blank_votes = sum(votos_blancos, na.rm = TRUE),
total_null_votes = sum(votos_nulos, na.rm = TRUE),
.groups = "drop"
)
# 2. Calculate Percentages of Blank and Null Votes
votes_by_year <- votes_by_year |>
mutate(
blank_percentage = (total_blank_votes / total_votes) * 100,
null_percentage = (total_null_votes / total_votes) * 100
)
# 3. Create Blank Votes Plot
plot_blancos <- ggplot(votes_by_year, aes(x = election_year, y = blank_percentage)) +
geom_line(color = "darkgreen", linewidth = 1.2) +
geom_point(color = "darkgreen", size = 3) +
geom_text(aes(label = sprintf("%.1f%%", blank_percentage),
vjust = ifelse(election_year %in% c(2011, 2016), 1.5, -1)),
size = 3, color = "darkgreen") +
scale_x_continuous(breaks = unique(votes_by_year$election_year)) +
scale_y_continuous(limits = c(0, 3), breaks = seq(0, 3, 1), expand = c(0,0)) +
labs(title = "Blank Votes Percentage",
x = "Election Year",
y = "Percentage") +
theme_minimal() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(color = "gray", linewidth = 0.3),
panel.grid.minor.y = element_blank())# 4. Create Null Votes Plot
plot_nulos <- ggplot(votes_by_year, aes(x = election_year, y = null_percentage)) +
geom_line(color = "darkred", linewidth = 1.2) +
geom_point(color = "darkred", size = 3) +
geom_text(aes(label = sprintf("%.1f%%", null_percentage),
vjust = ifelse(election_year %in% c(2008, 2016), 1.5, -1)),
size = 3, color = "darkred") +
scale_x_continuous(breaks = unique(votes_by_year$election_year)) +
scale_y_continuous(limits = c(0, 3), breaks = seq(0, 3, 1), expand = c(0,0)) +
labs(title = "Null Votes Percentage",
x = "Election Year",
y = "Percentage") +
theme_minimal() +
theme(panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(color = "gray", linewidth = 0.3),
panel.grid.minor.y = element_blank())
# 5. Combine and display plots
combined_plot <- plot_blancos + plot_nulos + plot_layout(ncol = 2) # Arrange vertically
print(combined_plot) # Display the combined plot2011 is the year with the most blank and null votes percentages, which could be attributed to the economic crisis and the subsequent discontent with politics, which we can attribute as well to the rise of new parties. The peaks in blank and null votes in 2011 align with widespread discontent during the economic crisis and the Movimiento 15M, which protested political corruption and austerity. This reflects frustration with traditional parties (PSOE and PP) and a lack of viable alternatives at the time. The decline in blank and null votes in 2015–2016 corresponds to the rise of Podemos and Ciudadanos, which offered new political options inspired by 15M ideals. We see the numbers for both decrease again for the 2015 and 2016 election, although it slightly raises again in 2019.
Animated GIF